Goto

Collaborating Authors

 fast structured decoding


Fast Structured Decoding for Sequence Models

Neural Information Processing Systems

Autoregressive sequence models achieve state-of-the-art performance in domains like machine translation. However, due to the autoregressive factorization nature, these models suffer from heavy latency during inference. Recently, non-autoregressive sequence models were proposed to speed up the inference time. However, these models assume that the decoding process of each token is conditionally independent of others. Such a generation process sometimes makes the output sentence inconsistent, and thus the learned non-autoregressive models could only achieve inferior accuracy compared to their autoregressive counterparts. To improve then decoding consistency and reduce the inference cost at the same time, we propose to incorporate a structured inference module into the non-autoregressive models. Specifically, we design an efficient approximation for Conditional Random Fields (CRF) for non-autoregressive sequence models, and further propose a dynamic transition technique to model positional contexts in the CRF. Experiments in machine translation show that while increasing little latency (8~14ms, our model could achieve significantly better translation performance than previous non-autoregressive models on different translation datasets. In particular, for the WMT14 En-De dataset, our model obtains a BLEU score of 26.80, which largely outperforms the previous non-autoregressive baselines and is only 0.61 lower in BLEU than purely autoregressive models.


Reviews: Fast Structured Decoding for Sequence Models

Neural Information Processing Systems

The paper proposes to boost translation quality of a non-autoregressive (NART) neural machine translation system through a conditional random field (CRF) that is attached to the decoder. The CRF reduces the translation quality drop compared to autoregressive neural translation systems by imposing a bigram-language model like structure onto the decoder that helps to alleviate the strong independence assumption that NART architectures entail. The CRF is jointly trained with all other parameters of the neural network. Experiments conducted on WMT14 and IWSLT14 En-De and De-En tasks are reported to yield improvements of more than 6 BLEU points over their corresponding baselines. By augmenting the decoder with a Markov-order 1 CRF, the resulting network is strictly speaking no longer a non-autoregressive system.


Reviews: Fast Structured Decoding for Sequence Models

Neural Information Processing Systems

The reviewers and I find the paper interesting, especially because such a simple approach performs favorably in comparison with non-autoregressive and expressive autoregressive models for machine translation. I recommend acceptance as a poster given that the reviewers raise several concerns about the original manuscript. I ask the authors to change the title as agreed in the rebuttal by using terms such as low-latency, fast, etc. It seems that the paper uses approximate partition function for training which is is not explained in details. The theoretical properties of such an approximation may be interesting to study.


Fast Structured Decoding for Sequence Models

Neural Information Processing Systems

Autoregressive sequence models achieve state-of-the-art performance in domains like machine translation. However, due to the autoregressive factorization nature, these models suffer from heavy latency during inference. Recently, non-autoregressive sequence models were proposed to speed up the inference time. However, these models assume that the decoding process of each token is conditionally independent of others. Such a generation process sometimes makes the output sentence inconsistent, and thus the learned non-autoregressive models could only achieve inferior accuracy compared to their autoregressive counterparts.


Fast Structured Decoding for Sequence Models

Neural Information Processing Systems

Autoregressive sequence models achieve state-of-the-art performance in domains like machine translation. However, due to the autoregressive factorization nature, these models suffer from heavy latency during inference. Recently, non-autoregressive sequence models were proposed to speed up the inference time. However, these models assume that the decoding process of each token is conditionally independent of others. Such a generation process sometimes makes the output sentence inconsistent, and thus the learned non-autoregressive models could only achieve inferior accuracy compared to their autoregressive counterparts.